requests

提供友好的HTTP协议访问方式

  • http://www.python-requests.org/en/master/

  • 第三方库,需要手动安装

    pip install requests

  • 简单示例:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    import requests
    response = requests.get('http://www.python-requests.org/en/master/')
    print(response.status_code)
    print(response.text)
    '''
    200

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

    <html xmlns="http://www.w3.org/1999/xhtml" lang="en">
    <head>..................
    '''

2个重要对象

  • Request对象

  • Response对象

    | Response最常用的5个属性 | 说明 |
    | ———————– | ———————————————————— |
    | r.status_code | HTTP请求的返回状态,200为成功 |
    | r.text | 根据encoding属性来返回HTTP报文体 |
    | r.encoding | 从HTTP header中charset字段猜测的响应内容编码方式 ,如果没有这个地段,则默认ISO-8859-1 |
    | r.apparent_encoding | 从内容中分析除的响应内容编码方式(备选编码方式) |
    | r.content | HTTP相应内容的二进制形式 |

    | Response常用方法 | 说明 |
    | ——————– | ————————————— |
    | r.raise_for_status() | 如果不是200,产生异常requests.HTTPEroor |

6种异常

网络连接不总是稳定可靠的

所以最好用try-except和raise_for_status搭配来处理异常

异常 说明
requests.ConnectionError 网络连接错误异常,如DNS查询失败,拒绝连接等
requests.HTTPError HTTP错误异常
requests.URLRequired URL缺失异常
requests.TooManyRedirects 超过最大重定向次数,产生重定向异常
requests.ConnectTimeout 连接远程服务器超时异常
requests.Timeout 请求URL超时,产生超时异常

7个主要方法

方法 说明
requests.requests() 构造一个请求,支撑以下各方法的基础方法
requests.get() 对于HTTP的GET请求
requests.head() 对于HTTP的HEAD请求
requests.post() 对于HTTP的POST请求
requests.put() 对于HTTP的PUT请求
requests.patch() 对于HTTP的PATCH请求
requests.delete() 对于HTTP的DELETE请求
  • request方法

    最基础也是最核心的方法,其他的方法都是由它封装而来。

    requests.request(method, url, **kwargs)

    method:请求方法,对应get,head,post,put等7种方法

    url:要访问的网络资源

    **kwargs:13个控制访问参数

    | 控制访问参数 | 描述 |
    | ————— | ———————————————- |
    | params | 字典或字节序列,作为参数增加到url中 |
    | data | 字典、字节序列或文件对象,作为Request的内容 |
    | json | JSON格式的数据,作为Request的内容 |
    | headers | 字典,HTTP定制头 |
    | cookies | 字典或CookieJar,Request中的cookie |
    | auth | 元组,支持HTTP认证功能 |
    | files | 字典类型,传输文件 |
    | timeout | 设定超时时间,秒为单位 |
    | proxies | 字典类型,设定访问代理服务器,可以增加登录认证 |
    | allow_redirects | True/False,默认True,重定向开关 |
    | stream | True/False,默认True,获取内容立即下载开关 |
    | verify | True/False,默认True,认证SSL证书开关 |
    | cert | 本地SSL证书路径 |

    1
    2
    kv = {'key1':'value1','key2':'value2'}
    st = '发送内容'
    1
    2
    3
    4
    5
    6
    # params
    r = requests.request('GET','http://python123.io/ws', params = kv)
    print(r.url)
    '''
    https://python123.io/ws?key1=value1&key2=value2
    '''
    1
    2
    3
    # data
    r = requests.request('POST','http://python123.io/ws', data = kv)
    r = requests.request('POST','http://python123.io/ws', data = st)
    1
    2
    # json
    r = requests.request('POST','http://python123.io/ws', json = kv)
    1
    2
    3
    # headers
    hd = {'user-agent':'Chrome/10'}
    r = requests.request('POST', 'http://python123.io/ws', headers = hd)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    # files
    fs = {
    'file':open('data.xls', 'rb')
    }
    r = requests.request(
    'POST',
    'http://python123.io/ws',
    files=fs
    )
    1
    2
    3
    4
    5
    6
    # timeout
    r = requests.request(
    'GET',
    'http://www.baidu.com',
    timeout = 10
    )
    1
    2
    3
    4
    5
    6
    7
    8
    9
    # proxies
    pxs = {
    'http':'http://user:pass@10.10.10.1:1234', 'https':'https://10.10.10.1:4321'
    }
    r = requests.request(
    'GET',
    'http://www.baidu.com',
    proxies=pxs
    )
  • get方法

    最常用

    r = requests.get(url, params=None, **kwargs)

    url是需要获取的页面链接

    params是url重点额外参数,字典或字节流格式

    **kwargs是12个控制访问的参数,同上

    requests.get是由requests.request方法封装的

    r是Response对象,包含服务器返回的所有信息。

  • head方法

    requests.head(url, **kwargs)

    1
    2
    3
    4
    5
    6
    7
    r = requests.head('http://www.python-requests.org/en/master/')
    print(r.headers)
    print(r.text)
    '''
    {'Content-Length': '0', 'Content-Type': 'text/html', 'Content-Encoding': 'gzip', 'Last-Modified': 'Thu, 13 Dec 2018 21:34:51 GMT', 'ETag': 'W/"5c12d07b-c7e2"', 'Vary': 'Accept-Encoding', 'Server': 'nginx', 'X-Cname-TryFiles': 'True', 'X-Served': 'Nginx', 'X-Deity': 'web01', 'Date': 'Fri, 22 Feb 2019 10:23:20 GMT'}
    ''
    '''
  • post方法

    requests.post(url, data=None, json =None, **kwargs)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    # 向URL post一个字典,自动编码为form表单
    payload = {'key1':'value1', 'key2':'value2'}
    r = requests.post('http://httpbin.org/post/', data = payload)
    print(r.text)
    '''
    {
    "args": {},
    "data": "",
    "files": {},
    "form": {
    "key1": "value1",
    "key2": "value2"
    },
    "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Content-Length": "23",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.21.0"
    },
    "json": null,
    "origin": "171.125.82.158, 171.125.82.158",
    "url": "https://httpbin.org/post"
    }
    '''
    1
    2
    3
    4
    5
    6
    # 向URL POST一个字符串 自动编码为data
    r = requests.post('http://httpbin.org/post', data = 'ABC')
    print(r.text)
    '''
    '{\n "args": {}, \n "data": "ABC", \n "files": {}, \n "form": {}, \n "headers": {\n "Accept": "*/*", \n "Accept-Encoding": "gzip, deflate", \n "Content-Length": "3", \n "Host": "httpbin.org", \n "User-Agent": "python-requests/2.21.0"\n }, \n "json": null, \n "origin": "171.125.82.158, 171.125.82.158", \n "url": "https://httpbin.org/post"\n}\n'
    '''
    1
    2
    3
    4
    5
    # post 提交表单
    data = {'first name':'xxxx','last naem':'hhh'}

    r = requests.post(url,data)
    print(r.text)
    1
    2
    3
    4
    # post 提交照片
    file = {'uploadFile':open('./image.jpg','rb')}
    r = requests.post(url,files=file)
    print(r.text)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    # 使用cookies
    payload = {'first name':'xxxx','last naem':'hhh'}
    r = requests.post(
    'http://pythonscraping.com/pages/cookies/welcome.php',
    data=payload
    )

    print(r.cookies.get_dict())
    r = request.get(
    'http://pythonscraping.com/pages/cookies/profile.php',
    cookies=r.cookies)
    print(r.text)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    # 使用session
    session = requests.Session()
    payload = {'first name':'xxxx','last naem':'hhh'}
    r = session.post(
    'http://pythonscraping.com/pages/cookies/welcome.php',
    data=payload
    )
    print(r.cookies.get_dict())
    r = session.get(
    'http://pythonscraping.com/pages/cookies/profile.php'
    )
    print(r.text)
  • put方法

    requests.put(url, data=None, **kwargs)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    payload = {'key1':'value1', 'key2':'value2'}
    r = requests.put('http://httpbin.org/put', data = payload)
    print(r.text)
    '''
    {
    "args": {},
    "data": "",
    "files": {},
    "form": {
    "key1": "value1",
    "key2": "value2"
    },
    "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Content-Length": "23",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.21.0"
    },
    "json": null,
    "origin": "171.125.82.158, 171.125.82.158",
    "url": "https://httpbin.org/put"
    }
    '''
  • patch

    requests.patch(url, data=None, **kwargs)

  • delete

    requests.delete(url, **kwargs)

实例

京东商品页面爬取

1
2
3
4
5
6
7
8
9
import requests
url = 'https://item.jd.com/100002338246.html'
try:
r = requests.get(url)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text)
except:
print('获取页面内容失败')

亚马逊商品页面爬取

1
2
3
4
5
6
7
8
9
10
11
12
# 增加UA
import requests
url = 'https://www.amazon.cn/dp/B06VTBN9ML/'
try:
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'}
r = requests.get(url, headers=headers)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.request.headers)
print(len(r.text))
except:
print('获取页面内容失败')

百度、360搜索关键字提交

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
'''
百度的关键词接口
http://www.baidu.com/s?wd=keyword
360的关键词接口
http://www.so.com/s?q=keyword
'''

import requests

url_baidu = 'http://www.baidu.com/s'
url_360 = 'http://www.so.com/s'

try:
r = requests.get(url_baidu, params={'wd':'python'})
# r = requests.get(url_360, params={'q':'python'})
r.raise_for_status()
print(r.request.url)
r.encoding = r.apparent_encoding
print(len(r.text))
except:
print('获取页面内容失败')

网络图片的爬取和存储

1
2
3
4
5
6
7
8
9
10
11
12
13
import requests
import os.path
pic_url = 'https://img.88tph.com/5e/e4/XuQeXjTUEemOjAAWPghzvA-0.jpg'
filename = pic_url.split('/')[-1]
path = os.path.join('d:/', filename)
try:
r = requests.get(pic_url)
r.raise_for_status()
with open(path, 'wb') as f:
f.write(r.content)
print('文件%s下载完成'%path)
except:
print('图片下载失败,请确保图片路径有效')

IP地址归属地查询

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
'''
ip138网站的API
http://www.ip138.com/ips138.asp?ip=xxx.xxx.xxx.xxx
'''

import requests
url = 'http://www.ip138.com/ips138.asp'
ip = {'ip':'192.168.1.1'}
try:
r = requests.get(url, params=ip)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text)
except:
print('获取页面内容失败')